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r Abstract 

o 

This paper addresses the problem of finding a B-term wavelet representation of a given discrete function / G 72" 
whose distance from / is minimized. The problem is well understood when we seek to minimize the Euclidean 
distance between / and its representation. The first known algorithms for finding provably approximate representations 

*vj minimizing general £ p distances (including 4x>) under a wide variety of compactly supported wavelet bases are 

presented in this paper. For the Haar basis, a polynomial time approximation scheme is demonstrated. These algorithms 

i | are applicable in the one-pass sublinear-space data stream model of computation. They generalize naturally to multiple 

dimensions and weighted norms. A universal representation that provides a provable approximation guarantee under 

all p-norms simultaneously; and the first approximation algorithms for bit-budget versions of the problem, known 

as adaptive quantization, are also presented. Further, it is shown that the algorithms presented here can be used to 

select a basis from a tree-structured dictionary of bases and find a B-term representation of the given function that 

provably approximates its best dictionary-basis representation. 
> 

^^ Index Terms 

o 
Tf 

o 

A central problem in approximation theory is to represent a function concisely. Given a function or a signal 

as input, the goal is to construct a representation as a linear combination of several predefined functions, under a 

O 

constraint which limits the space used by the representation. The set of predefined functions are denoted as the 



Nonlinear approximation, compactly supported wavelets, transform coding, best basis selection. 



dictionary. One of the most celebrated approaches in this context has been that of nonlinear approximation. In this 



approach the dictionary elements that are used to represent a function are allowed to depend on the input signal 
itself. 

Nonlinear approximations has a rich history starting from the work of Schmidt [2]; however, more recently these 
have come to fore in the context of wavelet dictionaries [3], [4]. Wavelets were first analyzed by DeVore et. al. 
[5] in nonlinear approximation. Wavelets and multi-fractals have since found extensive use in image representation, 
see Jacobs [6], In fact, the success of wavelets in nonlinear approximation has been hailed by many researchers as 
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"the 'true' reason of the usefulness of wavelets in signal compression" (Cohen et. al. [7]). Due to lack of space 
we would not be able to review the extremely rich body of work that has emerged in this context; see the surveys 
by DeVore [8] and Temlyakov [9] for substantial reviews. 

However, with the rise in the number of domains for which wavelets have been found useful, several interesting 
problems have arisen. Classically, the error in terms of representation has been measured by the Euclidean or £2 
error. This choice is natural for analysis of functions, but not necessarily for representation of data and distributions. 
Even in image compression, Mallat [3, p. 528] and Daubechies [4, p. 286] point out that while the £2 measure 
does not adequately quantify perceptual errors, it is used, nonetheless, since other norms are difficult to optimize. 
However, non-^2 measures have been widely used in the literature. Matias, Vitter and Wang [10], suggested using 
the t\ metric and showed that wavelets could be used in creating succinct synopses of data allowing us to answer 
queries approximately. The £\ distance is a statistical distance and is well suited for measuring distributions. 
Interestingly, Chapelle, Haffner and Vapnik [11] show that the £\ norm significantly outperforms the £2 norm in 
image recognition on images in the Corel data set using SVM's. From a completely different standpoint, we may 
be interested in approximating a signal in the l^ norm thus seeking a high fidelity approximation throughput rather 
than an 'average' measure such as other norms. This is particularly of interest if we are trying to process noisy 



data (we consider £\,loo approximations in Section III-Ci. While we have developed a reasonable understanding 



of £2 error, problems involving non-^ 2 error are still poorly understood. This paper takes the first steps towards 
filling this gap. 

One of the most basic problems in nonlinear approximation is the following: Given a wavelet basis {ijji} and a 
target function (or signal, vector) / € TV 1 , construct a representation / as a linear combination of at most B basis 
vectors so as to minimize some normed distance between / and /. The B-term representation / belongs to the 
space Tb = {Sr=i z ^i ■ Zi €lZ, \\z\\q < B}, where ||z||o is the number of non-zero coefficients in z £ lZ n . The 
problem is well-understood if the error of the representation is measured using the Euclidean or £2 distance. Since 
the £2 distance is preserved under rotations, by Parseval's theorem, we have 

Zi) ■ 



ii/ - /ill = E (/< - E, *^MJ = E «/>^ 



It is clear then that the solution under this error measure is to retain the largest B inner products (f,ipi), which 
are also the coefficients of the wavelet expansion of /. Note: the fact that we have to store the inner products or 
the wavelet coefficients is a natural consequence of the proof of optimality. 

The common strategy for the B-term representation problem in the literature has been "to retain the [B] terms in 
the wavelet expansion of the target function which are largest relative to the norm in which error of approximation 
is to be measured" [8, p. 4]. This strategy is reasonable in an extremal setting; i.e., if we are measuring the rate 
of the error as a function of B. But it is easy to show that the common greedy strategy is sub-optimal, see 
[12]— [17]. In light of this, several researchers [13]— [15], [17], [18] considered a restricted version of the problem 
under the Haar basis where we may only choose wavelet coefficients of the data. However to date, the only bound 
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on its performance with respect to the target function's best possible representation using B terms from the wavelet 
basis is given by Temlyakov [19] (see also [9, Sec. 7]). Temlyakov shows that given / in the (infinite dimensional) 
Banach function space L p [0, 1], 1 < p < oo, if the given basis {?/>i}iez is L p -equivalent to the Haar basis [20], then 
the error of the common greedy strategy is an a factor away from that of the optimal B-term representation. The 
factor a depends on p and properties of {ipi}, but the dependence is unspecified. However, from an optimization 
point of view in the finite-dimensional setting, the relationship between the factor a and the dimension n of the 
space spanned is the key problem, which we address here. Three relevant questions arise in this context. First 
is whether there are universal algorithms/representations that simultaneously approximate all £ p norms. This is 
important because in many applications, it is difficult to determine the most suitable norm to minimize without 
looking at the data, and an universal representation would be extremely useful. The second question concerns the 
complexity of representing the optimal solution. It is not immediate a priori that the optimal unrestricted solution 
minimizing, for example, the £$ norm for a function that takes only rational values can be specified by B rational 
numbers. The third related question pertains to the computational complexity of finding the optimum solution. Can 
the solution be found in time polynomial in the size of the input n? Or better yet, can the solution be found in 
strongly polynomial time where the running time of the algorithm does not depend on the numeric values of the 
input. We focus on these questions using the lens of approximation algorithms, where we seek to find a solution 
that is close to the optimum — in fast polynomial time. Note that the use of approximation algorithms does not limit 
us from using additional heuristics from which we may benefit, but gives us a more organized starting point to 
develop heuristics with provable bounds. 

A natural generalization of the problem above is known as Adaptive Quantization. The _B-term representation 
requires storing 2B numbers, the coefficient and the index of the corresponding basis vector to be retained. The 
actual cost (in bits) of storing the real numbers z% is, however, non-uniform. Depending on the scenario, it may 
be beneficial to represent a function with a large number of low-support vectors with low precision z/s or a few 
vectors with more detailed precision Zi's. Hence, a _B-term representation algorithm does not translate directly 
into a practical compression algorithm. A natural generalization, and a more practical model as noted in [7], is to 
minimize the error subject to the constraint that the stored values and indices cannot exceed a given bit-budget. 
Note that, again, we are not constrained here to storing wavelet expansion coefficients. This bit-budget version of 
the problem is known as adaptive quantization, which we will also consider. To the best of our knowledge, there 
are no known approximation algorithms for this problem. 

One other natural generalization incorporates a choice of basis into the optimization problem [8]. We are given 
a dictionary T> of bases and our objective is to choose a best basis in T> for representing / using B terms. This 
bi-criteria optimization problem is a form of highly nonlinear approximation [8]. In a seminal work, Coiffman 
and Wickerhauser [21] construct a binary tree-structured dictionary composed of O(nlog n) vectors and containing 
2°(f ) orthonormal bases. They present a dynamic programming algorithm that in O(nlogn) time finds a best basis 
minimizing the entropy of its inner products with the given function /. Mallat [3] discusses generalizations based 
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on their algorithm for finding a basis from the tree dictionary that minimizes an arbitrary concave function of its 
expansion coefficients. However, finding a basis in T> that minimizes a concave function of its inner products with 
the given / is not necessarily one with which we can best represent / (in an £ p sense) using B terms. Combining 
our approximation algorithms for the original B-term representation problem with the algorithm of Coiffman and 
Wickerhauser, we show how one can construct provably-approximate _B-term representations in tree-structured 
wavelet dictionaries. Several of these results also extend to arbitrary dictionaries with low coherence [22], [23]. 

Along with the development of richer representation structures, in recent years there has been significant increase 
in the data sets we are faced with. At these massive scales, the data is not expected to fit the available memory 
of even fairly powerful computers. One of the emergent paradigms to cope with this challenge is the idea of data 
stream algorithms. In a data stream model the input is provided one at a time, and any input item not explicitly 
stored is inaccessible to the computation, i.e., it is lost. The challenge is to perform the relevant computation in 
space that is sublinear in the input size; for example, computing the best representation of a discrete signal f(i) 
for % <G [n] that is presented in increasing order of i, in only o(n) space. This is a classic model of time-series data, 
where the function is presented one value at a time. It is immediate that under this space restriction we may not be 
able to optimize our function. This harks back to the issue raised earlier about the precision of the solution. Thus, 
the question of approximation algorithms is doubly interesting in this context. The only known results on this topic 
[24], [25] crucially depend on Parseval's Identity and do not extend to norms other than £2. 

In summary, even for the simplest possible transform coding problem, namely the B-term representation problem, 
we can identify the following issues: 

• There are no analysis techniques for £ p norms. In fact this is the bottleneck in analyzing any generalization 
of the B-term representation problem; e.g., the adaptive quantization problem. 

• All of the (limited) analyses in the optimization setting have been done on the Haar system, which although 
important, is not the wavelet of choice in some applications. Further, in this setting, the bounds on the 
performance of the algorithms used in practice which retain wavelet coefficients are unclear. 

• Signals that require transform coding are often presented as a streaming input — no algorithms are known except 
for £2 norms. 

• The computational complexity of transform coding problems for structured dictionaries, or even for wavelet 
bases, is unresolved. 

A. Our Results 

We ameliorate the above by showing: 

1) For the B-term representation problem we show that, 

a) The restricted solution that retains at most B wavelet coefficients is a O(logn) approximation to 
the unrestricted solution under all £ p distances for general compact systems (e.g. Haar, Daubechies, 
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Symmlets, Coiflets, among others. 1*1 We provide a 0(B+logn) space and 0(n) time one-pass algorithm 
in the data stream model. We give a modified greedy strategy, which is not normalization, but is similar to 
some scaling strategies used in practice. Our strategy demonstrates why several scaling based algorithms 
used in practice work well. 

b) A surprising consequence of the above is an universal representation using 0(B log n) coefficients that 
simultaneously approximate the signal for all l v distances up to O(logn). 

c) The unrestricted optimization problem has a fully polynomial-time approximation scheme (FPTAS) for 
all £ p distances in the Haar system, that is, the algorithm runs in time polynomial in B,e,n. The 
algorithm is one-pass, b' space and n p time for £ p distances. Therefore, the algorithm is a streaming 
algorithm with sublinear space for p > 1. For i^, the algorithm runs in polylog space and linear timqj 

d) For more general compactly supported systems we display how our ideas yield a quasi-polynomial 
time approximation scheme (QPTASrl This result is in contrast to the case of an arbitrary dictionary 
which, as we already mentioned, is hard to approximate to within any constant factor even allowing 
quasi-polynomial timq^j 

e) The results extend to fixed dimensions and workloads with increases in running time and space. 

2) In terms of techniques, we introduce a new lower bounding technique using the basis vectors {ipi}, which 
gives us the above result regarding the gap between the restricted and unrestricted versions of the problem. 
We also show that bounds using the scaling vectors {<fii} are useful for these optimization problems and, 
along with the lower bounds using {ipi}, give us the approximation schemes. To the best of our knowledge, 
this is the first use of both the scaling and basis vectors to achieve such guarantees. 

3) We show that the lower bound for general compact systems can be extended to an approximation algorithm 
for adaptive quantization. This is the first approximation algorithm for this problem. 

4) For tree-structured dictionaries composed of the type of compactly supported wavelets we consider, our 
algorithms can be combined with the dynamic programming algorithm of Coiffman and Wickerhauser [21] 
to find a B-term representation of the given /. The i v error of the representation we construct provably 
approximates the error of a best representation of / using B terms from a basis in the dictionary. 

The key technique used in this paper is to lower bound the solution based on a system of linear equations but 
with one non-linear constraint. This lower bound is used to set the 'scale' or 'precision' of the solution, and we 
show that the best solution respecting this precision is a near optimal solution by 'rounding' the components of the 
optimal solution to this precision. Finally the best solution in this class is found by a suitable dynamic program 

* This statement differs from the statement in the extremal setting that says that discarding all coefficients below r introduces 0(t log n) 
error, since the latter does not account for the number of terms. 

tFor clarity here, we are suppressing terms based on logn, B, and e. The exact statements appear in Theorems 16 and 18 
f This implies that the running time is 2°( log "' for some constant c (c = 1 gives polynomial time). 
§ Follows from the result of Feige [26]. 
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adapted to the data stream setting. 

We believe that approximation algorithms give us the correct standpoint for construction of approximate repre- 
sentations. The goal of approximation theory is to approximate representation; the goal of approximation algorithms 
is to approximate optimization. Data stream algorithms are inherently approximate (and often randomized) because 
the space restrictions force us to retain approximate information about the input. These goals, of the various uses 
of the approximation, are ultimately convergent. 

Organization: We begin by reviewing some preliminaries of wavelets. In Section III] we present our greedy 



approximation which also relates the restricted to the unrestricted versions of the problem. Section III presents 



applications of the greedy algorithm; namely, an approximate universal representation, approximation algorithms 



for adaptive quantization, and examples illustrating the use of non-^ 2 norms for image representations. Section IV 
is the main section of the paper wherein we present our approximation schemes. We detail the FPTAS for the Haar 
system and show its extensions to multiple dimensions and workloads. We subsequently demonstrate in Section |V| 
how the same ideas translate to a FPTAS for multi-dimensional signals and workloads, and a QPTAS under more 



general compactly supported wavelets. In Section VI we present the tree-structured best-basis selection algorithm. 



Finally, in Section VII we display some experimental results contrasting the performance of an optimal algorithm 



that is restricted to choosing Haar expansion coefficients with our Haar FPTAS. 

I. Preliminaries 
The problem on which we mainly concentrate is the following: 

Problem 1 (B-term Representation): Given / G lZ n , p G [l,oo], a compactly-supported wavelet basis for lZ n 
{ipi}2 = i, and an integer B, find a solution {z;}" =1 , Z{ G 1Z, with at most B non-zero components such that 
\\f ~Yli ZifaWp is minimized. 

We will often refer to this problem as the unrestricted i?-term representation problem in order to contrast it 
with a restricted version where the non-zero components of the solution can only take on values from the set 
{(f,ij}i),i G [n]}. That is, in the restricted version, each Z{ can only be set to a coefficient from the wavelet 
expansion of /, or zero. 

A. Data Streams 

For the purpose of this paper, a data stream computation is a space bounded algorithm, where the space is 
sublinear in the input. Input items are accessed sequentially and any item not explicitly stored cannot be accessed 
again in the same pass. In this paper we focus on one pass data streams. We will assume that we are given numbers 
/ = /(l), . . . , f(i), ■ ■ • , f(n) which correspond to the signal / to be summarized in the increasing order of i. 
This model is often referred to as the aggregated model and has been used widely [24], [27], [28]. It is specially 
suited to model streams of time series data [29], [30] and is natural for transcoding a single channel. Since we 
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focus on dyadic wavelets (that are dilated by powers of 2), assuming n is a power of 2 will be convenient, but 
not necessary. As is standard in literature on streaming [25], [31], [32], we also assume that the numbers are 
polynomially bounded, i.e., all |/(i)|'s are in the range [n~ c ,n c ] for some constant c. 

B. Compactly Supported Wavelets 

We include here some definitions and notation that we use in the main text. Readers familiar with wavelets can 
easily skip this section. For thorough expositions on wavelets, we refer the interested reader to the authoritative 
texts by Daubechies [4] and Mallat [3], For a brief introduction to wavelets, see [33, Chp. 2.3]. 

A wavelet basis {ipk}k=i f° r ^-" i s a basis where each vector is constructed by dilating and translating a single 
function referred to as the mother wavelet ip. For example the Haar mother wavelet, due to Haar [34], is given by: 

1 if < t < 1/2 
1>H(t) = { -1 if 1/2 < i < 1 

otherwise 



The Haar basis for 1Z n is composed of the vectors if>j !S [i] = 2~i/ 2 ipn I *~ ^ s j where % <G [n], j — 1, . . . , logn, 
and s = 0, . . . , n/2 : > — 1, plus their orthonormal complement -y=\ n . This last basis vector is closely related to 
the Haar multiresolution scaling function 4>H(t) = lif0<£<l and otherwise. In fact, there is an explicit 
recipe for constructing the mother wavelet function ip from <f> using a conjugate mirror filter [35], [36] (see also 
Daubechies [3], and Mallat [4]). Notice that the Haar mother wavelet is compactly supported on the interval [0, 1). 
This wavelet, which was discovered in 1910, was the only known wavelet of compact support until Daubechies 
constructed a family of compactly-supported wavelet bases [37] in 1988 (see also [4, Chp. 6]). 

The vector ipj tS is said to be centered at 2-?s and of scale j and is defined on at most (2q— 1)2 J — 2(q— 1) points. 
For ease of notation, we will use both ipi and ipj iS depending on the context and assume there is a consistent map 
between them. 

The Cascade Algorithm for computing (f,ipj yS ), (/, (f>j tS ): Assume that we have the conjugate mirror filter h 
with support {0, . . . , 2q — 1}. Given a function / e lZ n , we set ao[i] = f[i], and repeatedly compute aj+i[i] = 
J2 S M s — 2t]oj[s] and dj+i[£] = ^ s j[s — 2i]aj[s] (where g[k] = (— \) k h[\ — k] is also a conjugate mirror filter). 
Notice that if the filter h has support {0, • • • , 2q — 1}, then we have < s — 2t < 2q — 1. This procedure gives 
a AA = (f,<t>j,t) and dj[t] = (f,tpj,t)- 

In order to compute the inverse transform, we evaluate aj[t] — J^ s h[t — 2s]aj+i[s] + ^2 s g[t — 2s]dj+i[s], 
Observe that by setting a single a,j[s] or dj[s] to 1 and the rest to 0, the inverse transform gives us (j>j >s or ipj i3 . 
Indeed, this is the algorithm usually used to compute <j>j, s and ipj iS . 

We will utilize the following proposition which is a consequence of the dyadic structure of compactly-supported 
wavelet bases. 

Proposition 1: A compactly-supported wavelet whose filter has 2q non-zero coefficients generates a basis for 
lZ n that has 0(<7logn) basis vectors with a non-zero value at any point i £ [ri\. 
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II. Greedy Approximation Algorithms for General Compact Systems and Data Streams 
Recall our optimization problem: Given a compactly-supported wavelet basis {ipi} and a target vector /, we 
wish to find {zi} with at most B non-zero numbers to minimize ||/ — Yli Zi^iWv 

We present two analyses below corresponding to t^ and £ p errors when p € [1, oo). In each case we begin by 
analyzing the sufficient conditions that guarantee the error. A (modified) greedy coefficient retention algorithm will 
naturally fall out of both analyses. The proof shows that several of the algorithms that are used in practice have 
bounded approximation guarantee. Note that the optimum solution can choose any values in the representation /. 
In what follows the pair (p,p') are the usual conjugates; i.e., - + \ = 1 when 1 < p < oo, and when p = 1 we 
simply set p' = oo. For simplicity, we start with the p = oo case. 

1) An loo Algorithm and Analysis: The main lemma, which gives us a lower bound on the optimal error, is: 
Lemma 2: Let £ be the minimum error under the t^ norm and {z*} be the optimal solution, then 

-Uih\£\<(f,^i)-zt<\m\i\£\ . 

Proof: For all j we have —\£\ < f(j) — 2~2i z t rfiU) — \£\- Since the equation is symmetric multiplying it 
by V'fc(i) we get, 

-\£\\Mj)\ <fU)MJ)-MJ)J2 z *^^ <\ £ WMj)\ 

i 

If we add the above equation for all j, since — \£\ J^ |^fe(i)l = — |f HlV'felli we obtain (consider only the left side) 

-\£ \\\Mi <E/w^w-E^wE<^w 

j j i 

i 

The upper bound follows analogously. ■ 

A Relaxation.: Consider the following program: 

minimize r 



(1) 



-rllV'illi <(f,i>i)-zi <r||Vi||i 

-rUV'nlll <{f,1pn)-Z„ < T||V'«||l 

At most B of the z/s are non-zero 

Observe that £ is a feasible solution for the above program and £ > t* where r* is the optimum value of the 
program. Also, Lemma [5] is not specific to wavelet bases, and indeed we have £ = t* when {^i} is the standard 
basis, i.e. ipt is the vector with 1 in the i th coordinate and elsewhere. The next lemma is straightforward. 
Lemma 3: The minimum r of program ([T} is the (B + l) th largest value |§y . 
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The Algorithm.: We choose the largest B coefficients based on |(/, V^I/llV^Hi- This can be done over a one 
pass stream, and in 0(B + log n) space for any compact wavelet basis. Note that we need not choose z% = (f,fpi) 
but any Zi such that \z.- L — (/, V'^l/IIV'illi < r*. But in particular, we may choose to retain coefficients and set 
z i = (fi'tpi)- The alternate choices may (and often will) be better. Also note that the above is only a necessary 
condition; we still need to analyze the guarantee provided by the algorithm. 

Lemma 4: For all basis vectors ipi of a compact system there exists a constant C s.t. ||V't||p||V'»llp' — v^C- 
Proof: Suppose first that p < 2. Consider a basis vector -0i[] = "0j,s[] of sufficiently large scale j that has 
converged to within a constant r (point-wise) of its continuous analog ipj, s () [3, pp. 264-5]. That is, IV'j.sM ~~ 
V^sCOl — r f° r a ^ & suc h ^at ipj, s [k] ^= 0. The continuous function ipj,s0 is given by i^j, s (t) — 2~i/ 2 ip(2~H — s), 
which implies ipj, s [k] = O (2~^ 2 ijj(2~^k — s)) = 0(2~ J / 2 ). Note that we are assuming Halloo itself is some 
constant since it is independent of n and B. Combining the above with the fact that ipj. s \\ has at most (2q)2 : > 
non-zero coefficients, we have ||^>|| P ' = 0{2-^ 2 ((2q)2^) 1 /p') = 0(2 j{ 7'^{2q)7), 

Now by Holder's inequality, ||V>.j>||f> < (( 2 9)2 J ) 5 ' _s ll'0j>l|2 = 2 j( p~ 3) (2g)p~ 5 . Therefore, for sufficiently 

-•(Ail j\ - + - - 

large scales j, ||V , i,s||p||V , i,s||p' = 0(2 p p' (2q) p p' 2 ) = 0(y/q), and the lemma holds. For basis vectors at 
smaller (constant) scales, since the number of non-zero entries is constant, the l p norm and the £ p > norm are both 
constant. 

Finally, for p > 2, the argument holds by symmetry. ■ 

Theorem 5: The 4» error of the final approximation is at most 0(q 3 / 2 log n) times £ for any compactly supported 
wavelet. 

Proof: Let {zi} be the solution of the system ([T]), and let the set of the inner products chosen be S. Let r* is the 
minimum solution of the system ([T]). The t^ error seen at a point j is | Y^ias(f> V'i^iC?)! — Z^tas K/i V'^IIV'iC?)!- 
By Lemma[3| this sum is at most X^s r *IIV'i||i|'0i(J)l> which is at most r* max^g HV'illillV'illoo times the number 
of vectors that are non-zero at j. By Proposition fTJ the number of non-zero vectors at j is O(qlogn). By LemmaE] 
HV'ilkllV'illoo < \/qC for all i, and since t* < £ we have that the i^ error is bounded by 0(q 3 ^ 2 logn)£. ■ 

2) An £ p Algorithm and Analysis for p £ [l,oo): Under the t p norm, a slight modification to the algorithm 
above also gives an 0(q 3 ^ 2 log n) approximation guarantee. 

Lemma 6: Let £ be the minimum error under the t v norm and {z*} be the optimal solution, then for some 
constant cq, 

(Ep^K/>^>-^n <(c qlognfp£ . 
Proof: An argument similar to that of Lemma [2] gives 

E|/^ fc «-E/i^«^«| = X)6i^(*)i<( E tf) HV'fciip' 

i i v*= support of ^jfe / 



IV> 






P iG support of i/jfc 
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where the last inequality follows from Proposition fl] that each i belongs to O(qlogn) basis vectors (co is the 
constant hidden by the this O-term). ■ 

A Relaxation.: Consider the following system of equations, 

minimize r 

IS n^ii^ J - (co9logn)pT (2) 

At most B of the Zi's are non-zero 

77ze Algorithm.: We choose the largest B coefficients based on |(/, V'fc)|/||'0fc||pS which minimizes the sys- 
tem (J2j. This computation can be done over a one pass stream, and in 0(B + logn) space. 

Theorem 7: Choosing the B coefficients (f,ipk) that are largest based on the ordering |(/, V'fc)l/IIV'fc||p' i s a 
streaming 0(q 3 ^ 2 logn) approximation algorithm for the unrestricted optimization problem under the £ p norm. 
Note this matches the l^ bounds, but stores a (possibly) different set of coefficients. 

Proof: Let the value of the minimum solution to the above system of equations |2) be r*. Since {z*} is 
feasible for system Q, t* < £. Assume S is the set of coefficients chosen, the resulting error £$ is, 



3 = £ 



J2(f^k)Tpk(i) 



k<?S 



< J2( c oqlognr- 1 J2\(f^k)\ p \M^\ P 

i k#S 



= (coqlogny-^lif^kWUkVp 

< (coqlogny- l J2T^r\(fM P 

k?s im "p' 

= C p qHr*Coqlogn)P . 

Here, the first inequality is Holder's inequality combined with Proposition fl] and the fact that p/p' = p — 1; the 
second inequality follows from Lemma HI and the final equality follows from the optimality of our choice of 
coefficients for the system (|2j. Now since r* < £, we have that Eg < coCq^Elogn. ■ 

3) Summary and a Tight Example: In the two preceeding subsections we showed the following: 

Theorem 8: Let - + ^ = 1, Choosing the largest B coefficients based on the ordering |(/, V'«)I/IIV'*IIp'> which is 
possible by a streaming 0(_B + log?i) algorithm, gives a 0(q? logn) approximation algorithm for the unrestricted 
optimization problem (ProblemfTh under the given t v norm. The argument naturally extends to multiple dimensions. 

As is well-known, this choice of coefficients is optimal when p = 2 (since p' = 2 and \\1ftiW2 = 1). 

Note that the above theorem bounds the gap between the restricted (where we can only choose wavelet coefficients 
of the input in the representation) and unrestricted optimizations. 

A tight example for the l^ measure.: Suppose we are given the Haar basis {ipi} and the vector / with the 
top coefficient (f,ipi) = and with (/, V'iVIIV'illi = 1 — e for i < n/2, and (/, ^>i)/||V>i||i = 1 for i > n/2 (where 
ipi, i > n/2, are the basis with smallest support). Let B = n/c — 1 where c > 2 is a constant that is a power of 2. 
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The optimal solution can choose the B coefficients which are in the top log n — log c levels resulting in an error 
bounded by log c. The d.^ error of the greedy strategy on the other hand will be at least log n — 1 because it will 
store coefficients only at the bottom of the tree. Hence it's error is at least log n/ log c — o(l) of the optimal. 

Ill, Applications of the Greedy Algorithm 

Our greedy algorithm extends to a variety of scenarios, which illustrate the scope and the applicability of the 
techniques presented above. 

A. A Universal Representation 

In this section we present a strategy that stores B(log n) 2 coefficients and simultaneously approximates the optimal 
representations for all p-norms. Notice that in Problem [T] we know the p-norm we are trying to approximate. Here, 
we do not know p and we wish to come up with a representation such that for all p £ [1. oo], its error measured with 
11/ — Jul I is O(logn) times the optimal error min z ||/ — £\ ^i/'illp where x has at most B non-zero components. 
Notice that we allow our universal representation to store a factor (logn) 2 more components than any one optimal 
representation; however, it has to approximate all of them concurrently. 

We run our algorithm as before computing the wavelet coefficients of the target vector /; however, we need to 
determine which coefficients to store for our universal representation. To this end, define the set: 

N = {p t :pt = l + - , t = 0,...,logn(togn-l)} . (3) 

logn 

For every p t E Af, we will store the B coefficients that are largest based on the ordering |(/, V , fc)l/IIV'fc||p' where p' t 

is the dual norm to p t . Hence, the number of coefficients we store is no more than B(logn) 2 since \M\ = (logn) 2 . 

Note that our dual programs show that for a given p, storing more than B coefficients does not increase the error 

of the representation. Now let f u be our resultant representation; i.e., if S contains the coefficients we chose, then 

f u = ^2 ie g{f,ilJi)il)i', and let ff , be the optimal representation under the norm £ p . Consider first the case when 

pe{p t , Pt+i) where p t ,Pt+i e N. 

||/-/«|| p < ||/-/u|| pt since p > p t 

< cqi (log n) ||/ - f* (pt) \\ pt by Thereom § 

< cq i (log n) 1 1 / - / ( * p) 1 1 pt by the optimality of f* t for £ Pt 

< eg 5 (log n)n p t p \\f — //*„) by Holder's inequality (4) 

However l/pt — 1/p < l/pt — 1/pt+i since p < pt+i, and by their definition, 

1 1 logn 1 
= < • 

p t pt+i (logn + t)(logn + t+ 1) logn 

Hence, n,pt~p < n 1 /^ 08 ™) = 2; and from expression Q we have that ||/ — / u || = 0(q^ logn)||/ — //* \|| as 
required. When p > p t fort — logn(logn — 1), we immediately have n«~ 5 < n 1 ^' 08 ") and the result follows. 
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B. Adaptive Quantization 

Wavelets are extensively used in the compression of images and audio signals. In these applications a small percent 
saving of space is considered important and attention is paid to the bits being stored. The techniques employed 
are heavily engineered and typically designed by some domain expert. The complexity is usually two-fold: First, 
the numbers Zi do not all cost the same to represent. In some strategies; e.g., strategies used for audio signals, the 
number of bits of precision to represent a coefficient Zi corresponding to the basis vector ipi — ipj tS is fixed, and 
it typically depends only on the scale j. (Recall that there is a mapping from i[>i to ipj,s-) Further the Oj[]'s are 
computed with a higher precision than the dj\\ 's, This affects the space needed by the top-most coefficients. In yet 
another strategy, which is standard to a broad compression literature, it is assumed that log 2 Z bits are required to 
represent a number z. All of these bit-counting techniques need to assume that the signal is bounded and there is 
some reference unit of precision. 

Second, in several systems, e.g., in JPEG2000 [38], a bitmap is used to indicate the non-zero entries. However the 
bitmap requires 0(n) space and it is often preferred that we store only the status of the non-zero values instead of 
the status of all values in the transform. In a setting where we are restricted to o(n) space, as in the streaming setting, 
the space efficiency of the map between non-zero coefficients and locations becomes important. For example, we 
can represent ipi = ipj.s using log log n + log(n/2 J ) +0(1) bits instead of log n bits to specify i. Supposing that 
only the vectors with support of s/n or larger are important for a particular signal, we will then end up using half 
the number of bits. Notice that this encoding method increases the number of bits required for storing a coefficient 
at a small scale j to more than log n. This increase is (hopefully) mitigated by savings at larger scales. Note also 
that the wavelet coefficients at the same level are treated similarly. 

The techniques we presented in Section |U] naturally extend to these variants of the bit-budget problem. In what 
follows, we consider three specific cases: 

1) Spectrum Representations: The cost c, of storing a coefficient corresponding to i is fixed. This case includes 
the suggested strategy of using log log n + log(n/2 J ) + 0(1) bits. 

2) Bit Complexity Representations: The cost of storing the ith coefficient with value Zi is a + b(Zi) for some 
(concave) function b(). A natural candidate for b() is b(z) — O(l) — logzf ac where zf ac is the fractional 
part of Zi and is less than 1 (thus — log zf ac is positive). This encodes the idea that we can store a higher 
"resolution" at a greater cost. 

3) Multiplane Representations: Here the data conceptually consists of several "planes", and the cost of storing the 
ith coefficient in one plane depends on whether the ith coefficient in another plane is retained. For example, 
suppose we are trying to represent a RGBA image which has four attributes per pixel. Instead of regarding 
the data as 4 x 2 dimensional, it may be more useful, for example if the variations in color are non-uniform, 
to treat the data as being composed of several separate planes, and to construct an optimization that allocates 
the bits across them. 

The fundamental method by which we obtain our approximate solutions to the above three problems is to use a 
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greedy rule to lower bound the errors of the optimal solutions using systems of constraints as we did in Section III] 
We focus only on the l^ error for ease of presentation. As before, the techniques we use imply analogous results 
for £ p norms. 

1) Spectrum Representations: In the case where the cost of storing a number for i is a fixed quantity Cj we 
obtain a lower bound via a quadratic program that is similar to ([TJ using Lemma [2] That is, minimize r with the 
constraints x. L e {0,1} and J^i x i°i < B, and for all i 

-rUih < (f,il>i) - XiZi < T\\if>i\\i (5) 

The program above can be solved optimally since the c/s are polynomially bounded. We sort the coefficients in non- 
increasing order of yi := |(/, V , *)l/||V'i||i i ^f Ui 1 > Vi 2 > ■ ■ ■ > Vi„> then we include coefficients ii,...,ik where 
Xli=i c ij — B < S j=i c *j- The value 2/i fc+1 is then a lower bound on the error £ of the optimal representation z*. 
Note that z* is a feasible solution to program (|5J. Hence, either z* includes coefficients i 1; . . . , «*. in which case 
it cannot choose coefficient ifc+i for it will exceed the space bound B, and we have that £ > yt k+1 (the optimal 
does not necessarily set z% — \{f,ipi)\)', or, z* does not include one of ii, ■■.,%< thus £ is again greater then or 
equal to yi fc+1 . A proof similar to that of Theorem [5] shows that the error of our solution is 0(logn)£ . 

2) Bit Complexity Representations: In the case where the cost is dependent on Z{ we cannot write an explicit 
system of equations as we did in the case of spectrum representations. However, we can guess r up to a factor of 
2 and verify if the guess is correct. 

In order to verify the guess, we need to be able to solve equations of the form min z b(z) s.t. \a — z\ < t 
(since this is the format of our constraints). This minimization is solvable for most reasonable cost models; e.g., 
if b(z) is monotonically increasing. As the coefficients are generated, we compute c, + b(zi) if z% ^ 0, where 
Zi = argmin z 6(^) s.t. \(f,il>i) — z\ < t\\ipi\\i for our guess t of the error. If we exceed the alloted space B at any 
point during the computation, we know that our guess t is too small, and we start the execution over with the guess 
2t. Note that the optimal representation is a feasible solution with value £ and bit complexity B. Applying the 



analysis of Section [Tl-. 1 1 shows that the first solution we obtain that respects our guess is a O(logn) approximation 
to the optimal representation. 

Since we assume that the error £ is polynomially bounded, the above strategy can be made to stream by running 
O(logn) greedy algorithms in parallel each with a different guess of r as above. 

3) Multiplane Representations: In this case we are seeking to represent data that is conceptually in several 
"planes" simultaneously; e.g., RGBA in images. We could also conceptualize images of the same object at various 
frequencies or technologies. The goal of the optimization is to allocate the bits across them. However, notice that 
if we choose the ith coefficient for say the Red and the Blue planes (assuming that we are indicating the presence 
or absence of a coefficient explicitly which is the case for a sparse representation), then we can save space by 
storing the fact that "coefficient i is chosen" only once. This is easily achieved by keeping a vector of four bits 
corresponding to each chosen coefficient. The values of the entries in the bit vector inform us if the respective 
coefficient value is present. Therefore, the bit vector 1010 would indicate that the next two values in the data 
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correspond to Red and Blue values of a chosen coefficient. Similarly, a vector 1011 would suggest that three values 
corresponding to Red, Blue and Alpha are to be expected. 

In what follows, we assume that the data is D dimensional and it is comprised of t planes (in the RGBA example 
D = 2 and t = 4). We are constrained to storing at most B bits total for the bit vectors, the indices of the chosen 
coefficients, and the values of these coefficients. For simplicity we assume that we are using the t^ error across 
all the planes. Otherwise, we would also have to consider how the errors across the different planes are combined. 

We construct our approximate solution by first sorting the coefficients of the t planes in a single non-increasing 
order while keeping track of the plane to which each coefficient belongs. As before, we add the coefficients that 
are largest in this ordering to our solution, and stop immediately before the coefficient whose addition results in 
exceeding the alloted space B. Note that if we had added the ith coefficient of the Red plane first, and thereafter 
wanted to include the Blue plane's ith coefficient, then we need only account for the space of storing the index i 
and the associated bit vector when we add the coefficient for the first (in this case Red) plane. The subsequent ith 
coefficients only contribute to the cost of storing their values to the solution. (We can think of the cost of storing 
each coefficient as fixed after the ordering of the coefficients is determined.) This strategy is reminiscent of the 
strategy used by Guha, Kim and Shim [39] to lower bound the optimum error for a similar problem in the ti 
setting. 

The first coefficient we did not choose using this greedy selection process is a lower bound on the optimal 
representation error. Now, an argument similar to that of Theorem [5] shows that the error of the resulting solution 
is a 0(log n) factor away from the error of the optimal solution. 

C. Sparse Image Representation under non-1.2 Error Measures 

In this section we give three examples that demonstrate uses for our greedy algorithm in compressing images. 
A non-streaming version of the algorithm for Haar and Daubechies wavelets was implemented in Matlab using 
the UviJWave.300 toolbo^ [40] . Pseudocode of the implementation is provided below in Figure [T] The algorithm 
takes four parameters as input: the image X, the number of coefficients to retain B, the p-norm to minimize, and 
the type of Daubechies wavelet to use. The last parameter, q, determines the number of non-zero coefficients in the 
wavelet filter. Recall that the Haar wavelet is the Daubechies wavelet with smallest support; i.e., it has q = 1. 

The first example illustrates a use of the ioo measure for sparse representation using wavelets. Minimizing the 
maximum error at any point in the reconstructed image implies we should retain the wavelet coefficients that 
correspond to sharp changes in intensity; i.e., the coefficients that correspond to the "details" in the image. The 



image we used, shown in Figure 2(a) is composed of a gradient background and both Japanese and English 



texts' The number of non-zero wavelet coefficients in the original image is 65524. We set B = 3840 and ran 



Algorithm DaubGreedy with p = 1,2 and oo under the Haar wavelet (with q = 1). When p — 2, the algorithm 



^For compatibility with our version of MATLAB, slight modifications on the toolbox were performed. The toolbox can be obtained from 



http: //www.gts .tsc.uvigo.es/~wavelets/ 



" The Japanese text is poem number 89 of the Kokinshu anthology [41]. The translation is by Helen Craig McCullough. 
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Algorithm DaubGreedy(X , B,p, q) 

1. (* X is a grayscale image (intensity matrix) *) 

2. Perform a 2D wavelet transform of X using the Daubechies D q wavelet 

3. Let w be the wavelet coefficients of the transform 

4. p> <- p/(p - 1) 

5. y 4 *- l^il/llV'tllp' 

6. Let X be the indices of the 5 largest t/j's 

7. wj <- if i £ 1 

8. Perform a 2D inverse wavelet transform on the resulting w 

9. Let X' be the resulting image representation 

10. return X' 



Fig. 1 : Pseudocode of the greedy algorithm's implementation. 



outputs the optimal _B-term representation that minimizes the £ 2 error measure. That is, the algorithm simply retains 
the largest B wavelet coefficients (since p' — 2 and HV^IIp' = 1 for all i). When p = 1, or p = oo, the algorithm 
outputs a O (log n) -approximate S-term representation as will be explained in Section In] The results are shown in 
Figure [2] Notice that the ^ representation essentially ignores the gradient in the background, and it retains the 
wavelet coefficients that correspond to the text in the image. The t\ representation also does better than the £ 2 
representation in terms of rendering the Japanese text; however, the English translation in the former is not as clear. 
The attribution in the £ 2 representation, on the other hand, is completely lost. Although the differences between the 
three representations are not stark, this example shows that under such high compression ratios using the £ QO norm 
is more suitable for capturing signal details than other norms. 

The second example illustrates a use of the t\ error measure. Since the t\ norm is robust in the sense that it is 
indifferent to outliers, the allocation of wavelet coefficients when minimizing the l\ norm will be less sensitive to 
large changes in intensity than the allocation under the £ 2 norm. In other words, it implies that under the l\ norm 



the wavelet coefficients will be allocated more evenly across the image. The image we used, shown in Figure 3(a) 



is a framed black and white matte photograph. The number of non-zero wavelet coefficients in the original image 



is 65536. We set B = 4096 and ran Algorithm DaubGreedy with p = 1,2 and oo under the Daubechies D 2 
wavelet. The results are shown in Figure [3] Notice that the face of the subject is rendered in the £\ representation 
more "smoothly" than in the £ 2 representation. Further, the subject's mouth is not portrayed completely in the l 2 
representation. As explained earlier, these differences between the two representations are due to the fact that the 
t\ norm is not as affected as the t 2 norm by other conspicuous details in the image; e.g., the frame. The i^o 
representation, on the other hand, focuses on the details of the image displaying parts of the frame and the eyes 
well, but misses the rest of the subject entirely. This example foregrounds some advantages of the t\ norm over 
the customary £ 2 norm for compressing images. 
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Trans. H.C. McCi 



In the lingering wake 
of the breeze that has scattered 
the cherry tree's bloom, 
petal wavelets go dancing 
across the waterless sky 




in the lingering wake 
of the breeze that has scattered 
the cherry tree's bloom, 
petal wavelets go dancing 
across the waterless sky 



(a) The original image 



In the lingering wake 
of the breeze that has scattered 
the cherry tree's bloom, 
petal wavelets go dancing 
across the waterless sky 



(b) Output of the optimal 
B wavelet coefficients) 



algorithm (which retains the largest 




tl1he]in'jennrj w$k^ 
ol the breeze lfjai Jias soarBre-i 
Hie cherry "tree's bloom, 
petal wavelet go dancing 
across "the waterless sky 



(c) Output of our greedy algorithm under i 



(d) Output of our greedy algorithm under 



Fig. 2: Representing an image with embedded text using the optimal strategy that minimizes the £2 error, and our greedy 
approximation algorithm under the l^ and £ 1 error measures. The Haar wavelet is used in all three representations, and the 
number of retained coefficients is B — 3840. 
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(a) The original image 



(b) Output of the optimal 
B wavelet coefficients) 



algorithm (which retains the largest 




(c) Output of our greedy algorithm under i 



(d) Output of our greedy algorithm under 



Fig. 3: Representing an image using the optimal strategy that minimizes the £2 error, and our greedy approximation algorithm 
under the l^ and t\ error measures. The Daubechies D2 wavelet is used in all three representations, and the number of retained 
coefficients is B — 4096. 
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(a) The original image 



(b) Output of the optimal £2 algorithm (which retains the largest 
B wavelet coefficients) 




(c) Output of the best rank-12 approximation 



(d) Output of our greedy algorithm under 



Fig. 4: Representing an image using the optimal strategy that minimizes the £2 error and using our greedy approximation 
algorithm under the £\ error measure versus its best rank-fc approximation. Here k — 12, and the number of values stored in 
all three representations is 6144. The Haar wavelet is used in the two nonlinear representations (the number of retained wavelet 
coefficients is B = 3072). 
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The last example highlights the advantage of representing an image sparsely using a nonlinear wavelet ap- 
proximation versus using a rank-fc approximation of the image. Recall that if X is our image then the best rank-Zc 
approximation is given by Uk^kV^ where X — UY.V T is the SVD decomposition of X, and Uk is comprised of the 
k singular vectors corresponding to the largest k singular values of X (see, e.g., [42]). The original image is shown in 



Figure 4(a~ and the number of non-zero coefficients in its Haar wavelet expansion is 65536. Figure 4(c) shows the 
best rank-12 approximation of the image; i.e., it displays X12 = U 12^12^x2- This representation stores 6144 values 



corresponding to the number of elements in C/12S12 plus V\2. We set B — 3072 and ran Algorithm DaubGreedy 



with p = 1, 2 under the Haar wavelet (Figures 4(d) and 4(b) I. (The B-term representation problem implicitly 
requires storing IB numbers: the B values of the solution components that we compute, and the B indices of 
these components.) It is clear that the nonlinear approximations offer perceptually better representations that the 
approximation offered by the SVD. Also, as in the previous example, the l\ representation is again "smoother" 
than the £2 with less visible artifacts. 

IV. A Streaming (1 + e) Approximation for Haar Wavelets 

In this section we will provide a FPTAS for the Haar system. The algorithm will be bottom up, which is convenient 
from a streaming point of view. Observe that in case of general £ p norm error, we cannot disprove that the optimum 
solution cannot have an irrational value, which is detrimental from a computational point of view. In a sense we 
will seek to narrow down our search space, but we will need to preserve near optimality. We will show that there 
exists sets Ri such that if the solution coefficient Zi was drawn from Ri, then there exists one solution which is 
close to the optimum unrestricted solution (where we search over all reals). In a sense the sets Ri "rescue" us from 
the search. Alternately we can view those sets as a "rounding" of the optimal solution. Obviously such sets exist if 
we did not care about the error, e.g. take the all zero solution. We would expect a dependence between the sets Ri 
and the error bound we seek. We will use a type of "dual" wavelet bases; i.e., where we use one basis to construct 
the coefficients and another to reconstruct the function. Our bases will differ by scaling factors. We will solve 
the problem in the scaled bases and translate the solution to the original basis. This overall approach is similar to 
that in [43], however, it is different in several details critical to the proofs of running time, space complexity and 
approximation guarantee. 

Definition 1: Define t/>", s = 2~ j/2 V>j> and ip b js = V^^^s- Likewise define <% s = 2-^ 2 <f> jtS . 

Proposition 9: The Cascade algorithm used with -4= h\\ computes (f,ipi) and (f,4>f). 
We now use the change of basis. The next proposition is clear from the definition of {ip\}- 

Proposition 10: The problem of finding a representation / with {zi} and basis {ipi} is equivalent to finding the 
same representation / using the coefficients {y{\ and the basis {ipi}. The correspondence is y^ = y^ s = 2~^' 2 Zj tS - 

Lemma 11: Let {y*} be the optimal solution using the basis set {4>\} for the reconstruction, i.e., / = ^ Viipi 
and ||/ — /||p = £. Let {yf } be the set where each y* is rounded to the nearest multiple of p. If f p = V^ i y^\ 
then \\f - fP\\ p < £ + Oiqn 1 /? plogn). 

"The image is taken from a water painting by Shozo Matsuhashi. It is untitled. 
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Proof: Let pi = y* — y?. By the triangle inequality, 

ii/-/%<£+iiE^ii*> • 

Proposition 111 and the fact that \pi\ < p imply |^ fe piij}\{k)\ < cpqlognm.axi\ipi(k)\ for a small constant c. This 
bound gives ||/ — f p \\ p < £ + 0{qn 1 / p p log nmax^ Halloo)- Now i\}\ = ipj s — 2^/ 2 ipj S , and from the proof of 
Lemma 4 we know that for large j, ||V'j,s||oo is at most 2~^ 2 times a constant. For smaller j, ||V^J|oo is a constant. 

■ 

We will provide a dynamic programming formulation using the new basis. But we still need to show two results; 
the first concerning the y*'s and the second concerning the OjU's. The next lemma is very similar to Lemma [2] and 
follows from the fact that ||^£j|i = 2-^ 2 \\ip :j . s \\ 1 < y/2q. 

Lemma 12: —Co^/q£ < (/, ipf) — y* < Co^/qE for some constant Co. 
Now suppose we know the optimal solution /, and suppose we are computing the coefficients dj\\ and dj[] for 
both / and / at each step j of the Cascade algorithm. We wish to know by how much their coefficients differ since 
bounding this gap would shed more light on the solution /. 

Proposition 13: Let a,j[s](F) be a,j[s] computed from ao[s] — F(s) then a,j[s](f) — a,j[s](f) = a,j[s](f — /). 

Lemma 14: If ||/ — f\\ p < £ then |aj[s](/ — /)| < C\^fq£ for some constant C\, (We are using i/j[].) 
Proof: The proof is similar to that of Lemma Uj Let F = f — f. We know — £ < F(i) < £. Multiplying 
by \4>j tS (i)\ and summing over all i we get — £||</>" -s ||i < (F,(pj s ) = aj[s](F) < £\\<j>j S \\i. By definition, (f)°- s = 
2~i/ 2 (j)j S . Further, ||^>j, s ||2 = 1 an d nas at most {2q)2 : > non-zero values. Hence, ||0? s ||i < \/Zq- The lemma 
follows. ■ 

At this point we have all the pieces. Summarizing: 

Lemma 15: Let {zi} be a solution with B non-zero coefficients and with representation / = ^ z%i\)i. If ||/ — 
f\\ P < £, then there is a solution {yi} with B non-zero coefficients and representation /' = Z\^iVi' l P\ such that for 
all i we have, 

(i) yi is a multiple of p; 
(ii) \ yi -{f,W)\<C 0y /q£ + p;m&, 
(iii) \(f,tf) - (f',<f>i)\ < d^q£ + O(qplogn), 

and ||/ - /'||p <£ + 0{qn r / p p\ogn). 

Proof: Rewrite / = J^ i Ziipi = X)i z t' < Pi wnere z t — z 1 s — ^^ 2z j,s- Let {y{\ be the solution where each 



yi equals z* rounded to the nearest multiple of p. Lemmas 12 andPBIbound the z*'s thus providing properties (ii) 



and (iii). Finally, Lemma 11 gives the approximation guarantee of {yi}. ■ 

The above lemma ensures the existence of a solution {j/,} that is 0(qn 1 ^ p plogn) away from the optimal solution 
and that possesses some useful properties which we shall exploit for designing our algorithms. Each coefficient yi 
in this solution is a multiple of a parameter p that we are free to choose, and it is a constant multiple of £ away 
from the i th wavelet coefficient of /. Further, without knowing the values of those coefficients yj_ s contributing 
to the reconstruction of a certain point f'(i), we are guaranteed that during the incremental reconstruction of 
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f'(i) using the cascade algorithm, every Oj[s](/') in the support of f'(i) is a constant multiple of £ away from 
a,j[s)(f) = (/, 0? s ). This last property allows us to design our algorithms in a bottom-up fashion making them 
suitable for data streams. Finally, since we may choose p, setting it appropriately results in true factor approximation 
algorithms. Details of our algorithms follow. 

A. The Algorithm: A Simple Version 

We will assume here that we know the optimal error £. This assumption can be circumvented by running 0(log n) 
instances of the algorithm presented below 'in parallel', each with a different guess of the error. This will increase 



the time and space requirements of the algorithm by a O(logn) factor, which is accounted for in Theorem 16 (and 



also in Theorem 18 1. We detail the guessing procedure in Section IV- A. 1 Our algorithm will be given £ and the 
desired approximation parameter e as inputs (see Fig. |6j. 

The Haar wavelet basis naturally form a complete binary tree, termed the coefficient tree, since their support 
sets are nested and are of size powers of 2 (with one additional node as a parent of the tree). The data elements 
correspond to the leaves, and the coefficients correspond to the non-leaf nodes of the tree. Assigning a value y to 
the coefficient corresponds to assigning +y to all the leaves that are left descendants (descendants of the left child) 
and — y to all right descendants (recall the definition of {V'f })■ The leaves that are descendants of a node in the 
coefficient tree are termed the support of the coefficient. 

Definition 2: Let E[i,v,b] be the minimum possible contribution to the overall error from all descendants of 
node i using exactly b coefficients, under the assumption that ancestor coefficients of i will add up to the value v 
at i (taking account of the signs) in the final solution. 

The value v will be set later for a subtree as more data arrive. Note that the definition is bottom up and after we 
compute the table, we do not need to remember the data items in the subtree. As the reader would have guessed, 
this second property will be significant for streaming. 

The overall answer is min?, E[root, 0, b] — by the time we are at the root, we have looked at all the data and no 
ancestors exist to set a non-zero v. A natural dynamic program arises whose idea is as follows: Let ii, and ir be 
node i's left and right children respectively. In order to compute E[i,v,b], we guess the coefficient of node i and 
minimize over the error produced by i^ and ir that results from our choice. Specifically, the computation is: 

1) A non-root node computes E[i,v,b] as follows: 

{min ri f/ E[i L ,v + r, b'] + E[i R , v — r, b — b' — 1} 
min b / E[i L , v, b'} + E[i R , v,b- b'} 
where the upper term computes the error if the i th coefficient is chosen and it's value is r € Ri where Ri is 
the set of multiples of p between (/, ip?) — Coy/q£ and (/, tpf) + Coy/q£; and the lower term computes the 
error if the i th coefficient is not chosen. 

2) Then the root node computes: 

min rj {/ E[ic, r, V — 1] root coefficient is r 
nun \ 

miri;,/ E[ic, 0, b'] root not chosen 
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xl_ x2 x3 x4 xl x2_ x3 x4 

(a) The arrival of the first 3 elements. 





xl x2 x3 x4 



xl x2 x3 x4 
(b) The arrival of X4 



Fig. 5: Upon seeing X2 node 1 computes E[l, ■, ■] and the two error arrays associated with x\ and X2 are discarded. Element £4 
triggers the computation of E[2, ■, ■] and the two error arrays associated with X3 and X4 are discarded. Subsequently, E[3, ■, ■] 
is computed from E[l, ■, ■] and E[2, ■, ■] and both the latter arrays are discarded. If X4 is the last element on the stream, the 
root's error array, E[3, ■, •], is computed from E[2, ■, ■]. 



where ic is the root's only child. 

The streaming algorithm will borrow from the paradigm of reduce-merge. The high level idea will be to construct 
and maintain a small table of possibilities for each resolution of the data. On seeing each item f(i), we will first find 
out the best choices of the wavelets of length one (over all future inputs) and then, if appropriate, construct/update 
a table for wavelets of length 2,4,... etc. 

The idea of subdividing the data, computing some information and merging results from adjacent divisions 
were used in [27] for stream clustering. The stream computation of wavelets in [24] can be viewed as a similar 
idea — where the divisions corresponds to the support of the wavelet basis vectors. 

Our streaming algorithm will compute the error arrays E[i, ■, ■] associated with the internal nodes of the coefficient 
tree in a post-order fashion. Recall that the wavelet basis vectors, which are described in Section II] form a complete 
binary tree. For example, the scaled basis vectors for nodes 4,3, 1 and 2 in the tree of Fig. |5(a)| are [1, 1, 1, 1], 
[1, 1, —1, —1], [1, —1, 0, 0] and [0, 0, 1, —1] respectively. The data elements correspond to the leaves of the tree and 
the coefficients of the synopsis correspond to its internal nodes. 

We need not store the error array for every internal node since, in order to compute E[i,v,b] our algorithm 
only requires that E[il, •, •] and E[ifj, ■, ■] be known. Therefore, it is natural to perform the computation of the 
error arrays in a post-order fashion. An example best illustrates the procedure. Suppose / = (xi, x%, X3, X4). In 



Fig. 5(a) when element X\ arrives, the algorithm computes the error array associated with X\, call it E Xl . When 
element x-i arrives E X2 is computed. The array i?[l,-,-] is then computed and E Xl and E X2 are discarded. Array 
E X3 is computed when x% arrives. Finally the arrival of X4 triggers the computations of the rest of the arrays as in 



Fig. 5(b) Note that at any point in time, there is only one error array stored at each level of the tree. In fact, the 
computation of the error arrays resembles a binary counter. We start with an empty queue Q of error arrays. When 
Xi arrives, E qo is added to Q and the error associated with X\ is stored in it. When x^ arrives, a temporary node is 
created to store the error array associated with x<z. It is immediately used to compute an error array that is added 
to Q as E qi . Node E qo is emptied, and it is filled again upon the arrival of x 3 . When a; 4 arrives: (1) a temporary 
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E tl is created to store the error associated with x^\ (2) E tl and E qg are used to create E t2 ; E tl is discarded and 
E qo is emptied; (3) E t2 and E qi are used to create E q2 which in turn is added to the queue; E t2 is discarded and 
E qi is emptied. The algorithm for C^ is shown in Fig. [6] 
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Algorithm HaarPTAS(B, S, e) 

1. Let p = e£/(cqlogn) for some suitably large constant c. Note that q = 1 in the Haar case. 

2. Initialize a queue Q with one node q 

(* Each qi contains an array E q . of size at most Rmin{B,2 1 } and a flag isEmpty *) 

3. repeat Until there are no elements in the stream 

4. Get the next element from the stream, call it e 

5. if go is empty 

6. then Set qo.a = e. For all values r s.t. \r — e\ < c\E where c\ is a large enough constant and r is a multiple 

of p, initialize the table E qo [r, 0] = \r — e| 

7. else Create ii and Initialize E tl [r, 0] = |r — e| a.s /n 5fep[6] 

8. for i = 1 until the 1 st empty % or end of Q 

9. do Create a temporary node t%. 

10. Compute t2-a = (f,<l>i) and the wavelet coefficient t^.o = (f,ipf). This involves using the 
a values of t\ and g^_i (t 2 's two children in the coefficient tree) and taking their average to 
compute ti.u and their difference divided by 2 to compute t^-o. (Recall that we are using 

11. For all values r that are multiples of p with \r — t 2 .a\ < C\{£ + plogn), compute the table 
E t2 [r,b] for all < b < B. This uses the tables of the two children t\ and qi-\. The size of 
the table is 0{e^ x Bn 1 / p \ogn). (Note that the value of a chosen coefficient at node t% is at 
most a constant multiple of £ away from ^-o. Keeping track of the chosen coefficients (the 
answer) costs O(B) factor space more.) 

12. Set t\ <— £2 and Discard t 2 

13. Set gi.isEmtpy = true 

14. if we reached the end of Q 

15. then Create the node qi 

16. Compute E qi [r, b £ B] from t\ and q.^i as in Step 11 

17. Set ^.isEmpty = false and Discard t\ 

Fig. 6: The Haar streaming FPTAS for l x . 
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1) Guessing the Optimal Error: We have so far assumed that we know the optimal error £. As mentioned at 



the beginning of Section IV-A we will avoid this assumption by running multiple instances of our algorithm and 



supplying each instance a different guess Gk of the error. We will also provide every instance Ak of the algorithm 
with e' = \ e ~ 1 as the approximation parameter. The reason for this will be apparent shortly. Our final answer 
will be that of the instance with the minimum representation error. 



Theorem 16 shows that the running time and space requirements of our algorithm do not depend on the supplied 



error parameter. However, the algorithm's search ranges do depend on the given error. Hence, as long as Gk > £ 



the ranges searched by the k th instance will include the ranges specified by Lemma 15 Lemma 15 also tells us 



that if we search these ranges in multiples of p, then we will find a solution whose approximation guarantee is 
£ + cqn 1 / p plog n. Our algorithm chooses p so that its running time does not depend on the supplied error parameter. 
Hence, given Gk and e', algorithm Ak sets p = e' 'Gk / '(cqn 1 ^ 'log n) . Consequently, its approximation guarantee is 
£ + e'G k . 

Now if guess Gk is much larger than the optimal error £, then instance Ak will not provide a good approximation 
of the optimal representation. However, if Gk < (1 + e')£, then Ak's guarantee will be £ + e'(l + e')£ = (1 + e)£ 
because of our choice of e'. To summarize, in order to obtain the desired (1 + e) approximation, we simply need 
to ensure that one of our guesses (call it Gk*) satisfies 

£ < G k * < (l + e')£ 

Setting Gk = (1 + e') k , the above bounds will be satisfied when k = k* e [log 1+e , (£) , log 1+e /(£) + 1]. 

Number of guesses: Note that the optimal error £ = if and only if / has at most B non-zero expansion 
coefficients (f,ipi). We can find these coefficients easily in a streaming fashion. 

Since we assume that the entries in the given / are polynomially bounded, by the system of equations ([TJ we 
know that the optimum error is at least as much as the (B + l) st largest coefficient. Now any coefficient ((/, V'fc)) 
is the sum of the left half minus the sum of the right half of the //s that are in the support of the basis and the 
total is divided by the length of the support. Thus if the smallest non-zero number in the input is n~ c then the 
smallest non-zero wavelet coefficient is at least n~( c+1 '. By the same logic the largest non-zero coefficient is n c . 
Hence, it suffices to make O(logn) guesses. 

B. Analysis of the Simple Algorithm 

The size of the error table at node i, E[i, •, •], is i^minli?^^} where R^ — 2C\£ / p+logn and ti is the height 
of node i in the Haar coefficient tree (the leaves have height 0). Note that q = 1 in the Haar case. Computing each 
entry of E[i, ■, ■] takes 0{R^ min{i?, 2 li }) time where R^ = 2Cq£ / p+ 2. Hence, letting R = max{i?0, R^}, the 
total running time is 0(R 2 B 2 ) for computing the root table plus 0(X^ILi (i?min{2' i ,i?}) ) for computing all the 
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other error tables. Now, 



n logn 

^(i?min{2 t %S}) 2 = R 2 J2- t mm{2 2t ,B 2 } 




= 0(R 2 nB) , 

where the first equality follows from the fact that the number of nodes at level t is ||. For t^, when computing 
E[i,v,b] we do not need to range over all values of B. For a specific r G Ri, we can find the value of b 1 that 
minimizes msx.{E[i l , v + r, b'],E[in, v — r, b — b' — 1]} using binary search. The running time thus becomes, 

n 



^2R 2 —mm{t2 t ,BlogB} = 0(nR 2 log 2 B) 



L 



The bottom up dynamic programming will require us to store the error tables along at most two leaf to root paths. 
Thus the required space is, 

2^ i?min{2*, B} = 0(RB(l + log ^)) . 

t 

Since we set p — e£/(cn 1 / p logn), we have R = (9((n 1 / p logn)/e). 

Theorem 16: Algorithm HaarPTAS is a 0(e~ 1 B 2 n 1 / p log n) space algorithm that computes a (1 + e) approx- 



imation to the best i?-term unrestricted representation of a signal in the Haar system. Under the £ p norm, the 
algorithm runs in time 0{e~ 2 n 1+2 / p B log 3 n). Under l^ the running time becomes 0(e~ 2 7ilog 2 -Blog 3 n). 

The extra B factor in the space required by the algorithm accounts for keeping track of the chosen coefficients. 
C. An Improved Algorithm and Analysis 



For large n (compared to B), we gain in running time if we change the rounding scheme given by Lemma 1 1 
The granularity at which we search for the value of a coefficient will be fine if the coefficient lies toward the top 
of the tree, and it will be coarse if the coefficient lies toward the bottom. The idea is that, for small £ p norms, a 
mistake in a coefficient high in the tree affects everyone, whereas mistakes at the bottom are more localized. This 



idea utilizes the strong locality property of the Haar basis. We start with the lemma analogous to Lemma 1 1 

Lemma 17: Let {y*}, i = (ti,s) be the optimal solution using the basis set {ip^} for the reconstruction, i.e., 
/ = J2i Vti't ar, d 11/ ~ /lip = £- Here ti is the height of node i in the Haar coefficient tree. Let {y P } be the set 
where each y* is first rounded to the nearest multiple of p t . = e£/(2B2 ti /' p ) then the resulting value is rounded to 
the nearest multiple of p tiM = e£/{2Bn 1 / p ). If fP = £\ y?ip\ then ||/ - f% < (1 + e)£. 
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Proof: As in Lemma [TT[ we need to estimate ||X^(yf — ViWiWp but usm g me new rounding scheme. Let S 
be the set of indices i such that y\ ^ 0. 



< J2.APu+Pt M )Ul\\ P 

- L^n^s ' * 
The last inequality follows from the fact that 2 fi components of ^\ are equal to one and the rest are zero. The 
approximation hence follows from \S\ < B and our choices of p ti . ■ 

The granularity of the dynamic programming tables E[i, ■, •] is set according to the smallest p ti which is p tlM — 
eE /{2Bn 1 l p ). This allows their values to align correctly. More specifically, when a coefficient is not chosen we 
compute (see Section IV-A|i 



E[i, v, b] — ra.m.E[iL,v, b'] + E[in, v, b — b'] . 

b' 

A value v will that is not outside the range of E[iL, ■, ■] and E[ir, •, •] will be a correct index into these two arrays. 
We gain from this rounding scheme, however, when we are searching for a value to assign to node i. If i is chosen, 
we can search for its value in the range (f,ipf) ± 2Cq£/p in multiples of p ti . Hence, as mentioned earlier, the 
granularity of our search will be fine for nodes at top levels and coarse for nodes at lower levels. More formally, 
if i is chosen, we compute 

E[i, v, b) — mini?[iL,w + r, b'] + E[in, v — r, b — b 1 — 1] , 

r.b' 

where we search for the best r in multiples of p ti . The value v + r (resp. v — r) may not index correctly into 
E[il, •) •] (resp. E[in, •, •]) since p ti — 2 d / p p trM where d = t root — U. Hence, we need to round each value of r we 



wish to check to the nearest multiple of p tlM . This extra rounding is accounted for in Lemma 17 



Letting R be the number of values each table holds and R ti — 2Co£/pt i + 2 be the number of entries we 



search at node i, and using an analysis similar to that of Section IV-B the running time (ignoring constant factors) 
becomes, 



™ l0g " -n R9*/P 

0£>i? tl min{2 2 \B 2 }) = 0(R J] ^ — min{2 2t , S 2 }) 




2 t/ P -t ^ 



[ b i+i/p\ 



c 

Hence, since R = 0(n 1 ' p B/e) based on the granularity pt ml , the running time for each instance of the algorithm 
is 0((nB) 1+1 / p B 2 /e 2 ). The space requirement is the same as that of the simpler algorithm; namely, O(RBlogn). 
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Theorem 18: The above algorithm (with the new rounding scheme) is a 0(e~ 1 B 3 n 1 / p log 2 n) space algorithm 
that computes a (1 + e) approximation to the best B-term unrestricted representation of a signal in the Haar system 
under the £ p norm. The algorithm runs in time 0{e~ 2 {nB) 1+1 / p B 2 logn). 



Again, and as in Theorem 16 the extra B factor in the space requirement accounts for keeping track of the 
chosen coefficients, and the extra logn factor in both the space and time requirements accounts for the guessing 
of the error. 

We choose the better of the two algorithms (or rounding schemes) whose approximation and time and space 



requirements are guaranteed by Theorems 16 and 18 



V. Extensions 
A. PTAS for multi- dimensional Haar Systems 



Our algorithm and analysis from Section IV extend to multi-dimensional Haar wavelets when the dimension D 
is a given constant. For D > 2 define 2 D — 1 mother wavelets (see also [12], [18]).. For all integers < d < 2 
let 

^ d (x) = 8 (h (x 1 )9 d >(x 2 )---6 d °(x D ) , 

where did 2 ■ ■ ■ djj is the binary representation of d and 6° = </>, 9 1 = ift. For d=0we obtain the D-dimensional 
scaling function -0°(a;) = 4>(xi)(f)(x 2 ) ■ ■ ■ <fi(xrj). At scale 2 J and for s — (s\, S2, ■ ■ ■ , sd) define 



^(-) = 2-^v(^i,---,^^) 



The family {tp d s }i<d<2 D .(j,n)ei D+1 is an orthonormal basis of L 2 (M. D ) [3, Thm. 7.25]. Note that in multi- 
dimensions we define i/>"' = 2~ D H 2 ib d , tb.\ — 2 D ^ 2 ^b d a and 6 a -\ = 2~ D: >/ 2 'ib® . which is analogous to 
Definition[l] Thus ||V^;/||i = ll^.flli = 1 since ||V^J|i = 2 zy 2-- Ey / 2 = 2 D ^ 2 . Also ||^||oo = 1. Each node 
in the coefficient tree has 2 D children and corresponds to 2 D — 1 coefficients (assuming the input is a hypercube). 
The structure of the coefficient tree will result in a 0(R 2 _1 ) increase in running time over the one-dimensional 
case where R = 0{e~ 1 n 1 / p \ogri). 



As in Section IV-A we associate an error array E[i, b, v] with each node i in the tree where v is the result of 
the choices of i's ancestors and b < B is the number of coefficients used by the subtree rooted at i. The size of 
each table is thus 0(min{2 D: > , B}R) where j is the level of the tree to which i belongs. When computing an entry 
E[i, b, v] in the table, we need to choose the best non-zero subset S of the 2 D — 1 coefficients that belong to the 
node and the best assignment of values to these \S\ coefficients. These choices contribute a factor 0((2R) 2 _1 ) to 
the time complexity. We also have to choose the best partition of the remaining b — \S\ coefficients into 2 D parts 
adding another 0(B 2 ) factor to the running time. We can avoid the latter factor by ordering the search among the 
node's children as in [12], [18]. Each node is broken into 2 D — 1 subnodes: Suppose node i has children ci, . . . , c 2 d 
ordered in some manner. Then subnode i t , will have c t as its left child and subnode i t -\ as its right child. Subnode 
i 2 D_i will have c 2 r>_ 1 and c 2 d as its children. Now all subnode i t needs to do is search for the best partition 
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of b into 2 parts as usual. Specifically, fix S and the values given to the coefficients in S. For each v, b' with 
< b' < niin{2 13: ', 6— 15*1}, each subnode starting from £2^-1 computes the best allotment of b' coefficients to its 
children. This process takes 0(R(min{2 D: > , B}) 2 ) time per subnode. For t^ the bounds are better. All the error 
arrays for the subnodes are discarded before considering the next choice of S and values assigned to its elements. 
Hence, assuming the input is of size N, and since there are N/2 D ^ nodes per level of the coefficient tree, the total 
running time is 

£ ^-(2i?) 2l, - 1 2^i?(n 1 in{2^,B}) 2 =0{NBI? D ) 



where we dropped the constant factors involving D in the final expression. Finally recall from Section |IV-A| that 
we need to make 0(log N) guesses for the error £ . 

B. QPTAS for General Compact Systems 

We show a simple dynamic programming algorithm that finds a (1 + e)-approximation to the wavelet synopsis 
construction problem under the l^ norm. The algorithm uses g(q, n) — 7i ( 9 ( los<?+loglog ™)) time and space. Under 
the £ p norm, the algorithm uses n ^ og9+ p '' time and space. We will describe the algorithm for the Daubechies 
wavelet under the l^ norm. Recall that the Daubechies filters have 2q non-zero coefficients. 

For a given subproblem, call an edge an interface edge if exactly one of its endpoints is in the subproblem. Each 
interface edge has a value associated with it which is eventually determined at a later stage. We will maintain that 
each subproblem has at most 4glogn interface edges. A subproblem has a table E associated with it where for 
each b < B and each configuration / of values on interface edges, E[b,I] stores the minimum contribution to the 
overall error when the subproblem uses b coefficients and the interface configuration is /. From Lemma [15] setting 

p = e£/(ciqlogn) for some suitably large constant c\, each interface edge can have one of V = 0(- 2SJ1} 

values under the £oo norm. Hence, the size of E is bounded by BV 4ql ° sn = g(q,n). 

The algorithm starts with an initialization phase that creates the first subproblem. This phase essentially flattens 
the cone-shape of the coefficient graph, and the only difference between it and later steps is that it results in one 
subproblem as opposed to two. We select any 2q consecutive leaves in the coefficient graph and their ancestors. 

This is at most 2glogn nodes. We will guess the coefficients of the optimal solution associated with this set of 

§3/2 , 
each coefficient can take one of W = 0(- — ^ og ) values under the l^ norm. For 

each of the (2W) 2qy ° gn = g(q,n) guesses, we will run the second phase of the algorithm. 

In the second phase, given a subproblem A, we first select the 2q 'middle' leaves and their ancestors. Call this 

strip of nodes S. Note that \S\ < 2qlogn. The nodes in S break A into two smaller subproblems L and R (see 

Fig. U\. Suppose we have El and En, the two error arrays associated with L and R respectively. We compute each 

entry £^[6, i] as follows. First, we guess the b' non-zero coefficients of the optimal solution associated with the 

nodes in S and their values. Combined with the configuration /, these values define a configuration Tj, (resp. Ir) 

for the interface edges of L (resp. R) in the obvious way. Furthermore, they result in an error e associated with 
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Fig. 7: An example subproblem. The shaded nodes belong to the strip S. The edges crossing the 'frontier' are interface edges. 



the leaf nodes in S. Hence, 



E[b, I] = e + mmm&x{E L [b", I L ],E R [b - b' - b",I R }} . 

b" 

Therefore, computing each entry in E takes at most B(2W) 2ql ° sn = g(q, n) time. The running time of the algorithm 
follows. 

Theorem 19: We can compute a (1+e) approximation to the best B-term unrestricted representation of a compact 
system under the i x norm in time ri O(?(log«+loglogn)) i 

The result also extends to l p norms, but remains a quasi-polynomial time algorithm. The main point of the above 
theorem is that the representation problem is not Max-SNP-Hard. 

C. Workloads 

The algorithm and analysis from Section llV] also extend to weighted cases/workloads under the same assumptions 
as in [16]. Namely, given / and {wi} where 2~27=i w* = 1 and < Wi < 1, we wish to find a solution {zi} with 
at most B non-zero coefficients that minimizes, 

= (£ X /0-)-£.^(j) pJ " 

p,w V* — 'j * — '1 



f- 



Zilpi 



Letting w = mini i"i and W = max^ Wi, we will show how our approximation algorithm extends to this case with 



■w* 



a factor ^- increase in its space requirement and a factor 1-^-1 increase in running time. 

The following three lemmas are analogs of lemmas [TT] [T4| and [T2| respectively. The first two are straightforward, 
but note the factor W in the additive approximation. 

Lemma 20: Let {y*} be the optimal solution using the basis set {V^ 6 } for the reconstruction, i.e., / = J^ y*4>i 
and ||/ — /||p, u, = £ ■ Let {yf } be the set where each y* is rounded to the nearest multiple of p. If f p = J^ i J/fV'i' 
then ||/ - fP\\ Pt „ <£ + 0(qWn l /Pplogn). 

Lemma 21: -C x ^q% < (/, <% s ) - (/, ^J < C x Jq% for some constant C x . 

Lemma 22: ~C 0y /q^ < {f,^j s ) - V* < C 0y /q^ for some constant C . 
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Proof: For all j we have Wj\f(j) — J2i Vii^iU)] — £■ Multiplying by IV'fcGOl ar, d summing over all j we get 

Y1 w Mj)Mj) -Y,y^^™ ^ will* 

3 » 

E/W*(i)-E»*E^W*(i)l < \mu 



3 

w\ 



J * J 

=* H(/,^}-2/fcl< ^ , 

completing the proof. ■ 

Hence, setting p = e£/(cgWn 1 / p logn) for some suitably large constant c, we get the desired approximation with 
R from the analysis above equal to 0(£/(wp)) = 0(—qe~ 1 n 1 / p logn). 

D. Quality Versus Time 

A natural question arises, if we were interested in the restricted synopses only, can we develop streaming 
algorithms for them? The answer reveals a rich tradeoff between synopsis quality and running time. 

Observe that if at each node we only consider either storing the coefficient or 0, then we can limit the search 
significantly. Instead of searching over all v + r to the left and v — r to the right in the dynamic program (which 

we repeat below) 

{min rj ;,/ E\il,v + r, b'] + E[in, v — r, b — b' — 1] 
min b / E[i L ,v, b'} + E[i R , v,b- b'} 
we only need to search for r = (/, ipf) — observe that a streaming algorithm can compute (/, ipf) (See [24]). 
However we have to "round" (/, ijjf) to a multiple of p since we are storing the table corresponding to the 
multiples of p between (/, (f>f) — C\^fq£ and (/, (f>f) + C\^fq£. We consider the better of rounding up or rounding 
down (/, ipf) to the nearest multiple of p. The running time improves by a factor of R in this case since in order 
to compute each entry we are now considering only two values of Ri (round up/down) instead of the entire set. 
The overall running time is O(RnB) in the general case and 0(Rn log 2 B) for the i^ variants. The space bound 
and the approximation guarantees remain unchanged. However the guarantee is now against the synopsis which is 
restricted to storing wavelet coefficients. 

The above discussion sets the ground for investigating a variety of Hybrid algorithms where we choose different 
search strategies for each coefficient. We introduced this idea in [43] but in the context of a weaker approximation 
strategy. One strategy we explore in Section fVTl] is to allow the root node to range over the set Ri while considering 
the better of rounding up or rounding down (/, ipf) to the nearest multiple of p for all other coefficients (i > 1). 
We show that this simple modification improves on the quality of the restricted synopsis and on the running time 
of the unrestricted algorithm. 

VI. Best Basis Selection from a Dictionary 

In this section we show how our algorithms can be extended to find representations in certain types of tree- 
structured dictionaries. Specifically, the dictionaries we consider are full binary tree-structured dictionaries composed 
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of compactly supported wavelets. Given / G R", B e Z and such a dictionary T>, we now wish to find the best 
S-term representation of / in a basis from T>. Notice that we seek both the best basis in T> for representing / using 
B terms and the best _B-term representation of / in this basis. The error of the representation is its £ p distance from 



/. We show in Theorem 25 how our algorithms from the previous sections can be used to find provably-approximate 
answers to this bi-criteria optimization problem. 

We start with the description of our tree-structured dictionaries. Similar to Coiffman and Wickerhauser [21], our 
dictionaries will be composed of 0(n log n) vectors, and will contain 2°(t) bases: equal to the number of cuts in 
a complete binary tree. 

Let a(j >p ) = 2 J p and let g (j>) [t] = l[ 0(iiJ>)l o , p+1) -i][*] be the discrete dyadic window that is 1 in [a,( jtP) , ay iP+1) - 
1] and zero elsewhere. Each node in D is labeled by (j,p), < j < logn, < p < n2~' J — 1, where j is the 
height of the node in the tree (the root is at height log n), and p is the number of nodes to its left that are at the 
same height in a complete binary tree. With each node (j,p) we associate the subspace W(j )P ) of W 1 that exactly 
includes all functions / G W l whose support lies in £f( J)P ). Clearly, Wa O gn,0) = K n - 

Now suppose {ek,i}o<k<i, I > is an orthonormal basis for R\ Then, 

% P ) = {^fc,(i,p)M = 9{j, P )[t}e k:23 [t - a(j, P )]} < fe<2J , 

is an orthonormal basis for W(j iP y 

Proposition 23: For any internal node (j,p) in the dictionary T>, W(j-i,2 P ) an d Wfj_i 2p+i) m ' e orthogonal, and 

w a ,p) = w -_ li2p) ew (j -_i,2p+i) • 

We can thus construct an orthonormal basis of Wfyjp) via a union of orthonormal bases of W(j-i t 2p) ar, d W(j_i,2 P +i)' 
Corollary 24: Let {(ji,Pi)} be the set of nodes corresponding to a cut in the dictionary tree. We have, 

Hence, there are 0{2 n / 2 ) bases in our dictionary. 

The main result of this section follows. We prove it under the (.^ error measure. The argument is extended to 
general l p error measures in a straightforward manner. 

Theorem 25: If A is an (streaming) algorithm that achieves a C-approximation for the S-term representation 
problem under i^ (for any wavelet included in the dictionary T>), then A is a (streaming) C-approximation for the 
bi-criteria representation problem. 

Proof: Let -E(j, p ) [b] be the minimum contribution to the overall error (as computed by A) from representing 
the block <?(j. p ) [£]/[£] using b vectors from a basis of WYj- iP ). Call the basis that achieves this error the best basis for 



Wu tP ) and denote it by Ou p \[b]. By Proposition 23 there are 0(2 2 ) possible bases for the space W(j p ) in T>. 



Now if (j,p) is a leaf node, then E<j tP -\ [b] = A{g(j p -\ /, Bu tP \ , b), which is the error resulting from representing 
the block <?(j, p ) [£]/[£] using b vectors from the basis t3(j ;P y Otherwise, if (j,p) is an internal node, £ , ( Jp )[6] equals 



mm < 



min 6 max{£ _ li2p )[fe'], E {j _ lt2p+ i)[b - b']} 
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and 

J By*) if %p) H = A (30",p) © /> %p) . 6 ) 

[ 0(]-i,2p) [&0»] u °0'-i,2p-i) [ & - &G»] else ' 
where 6(j lP ) is the argument that minimizes the top expression in E/j }P \[b], 

Suppose Opt chooses the cut {(jo,Po)} with the corresponding partition {b } of B and we choose the cut 
{(ji,Pi)} with partition {bi}. By the dynamic program above we have, 

maxyl(0(j. jP .) /,B(j<, Pi ),fy) = max E( jitPi )[bi] (6a) 

<max£(j o]Po )[6(,] (6b) 

< max A(g (jo7Po ) /, B(j otPo ),b ) (6c) 

< CmaxOPT(s- (jo)Po ) /, B(j , Po ),b ) (6d) 
= COpt , (6e) 



where ( |6b] l follows from the fact that our dynamic program chooses the best cut and corresponding partition of B 
among all possible cuts and partitions based on the errors computed by algorithm A; ( |6c| ) follows from the definition 
of our dynamic programming table entries E(j iP )[b]; ( |6c| l follows from the assumption that A is a C-approximation 
algorithm; and (J6eJ follows from the optimal substructure property of our problem. ■ 

VII. Comparing Restricted and Unrestricted optimizations 

We consider two issues in this section, namely (i) the quality of the unrestricted version vis-a-vis the restricted 
optimum solution and (ii) the running times of the algorithms. We will restrict our experiments to the d^ norm. 

A. The algorithms 

All experiments reported in this section were performed on a 2 CPU Pentium-Ill 1.4 GHz with 2GB of main 
memory, running Linux. All algorithms were implemented using version 3.3.4 of the gcc compiler. 
We show the performance figures of the following schemes: 

REST This characterizes the algorithms for the restricted version of the problem. This is the 0(n 2 ) time 0(n) 

space algorithm in [17] (see also [14], [15], [18]). 



UNREST This is the streaming algorithm for the full general version described in Algorithm HaarPTAS based 
on the discussion in Section |IV|N 



HYBRID This is the streaming hybrid algorithm proposed in Section V-D 
Note that the UNREST and HYBRID algorithms are not the additive approximation algorithms in [43] (although 
we kept the same names). 



t^The implementation is available from http: //www.cis .upenn.edu/~boulos 
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(a) Error for the Saw data set (n = 2048) 



(b) Error for the Dow data set (n = 16384) 



Fig. 8: The i x error of the three algorithms, UNREST, REST, and HYBRID for the two data sets. 



B. The Data Sets 

We chose a synthetic data set to showcase the point made in the introduction about the sub-optimality of the 
restricted versions. Otherwise we use a publicly available real life data set for our experiment. 

• Saw: This is a periodic data set with a line repeated 8 times, with 2048 values total. 

• DJIA data set: We used the Dow- Jones Industrial Average (DJIA) data set available at StatLitpjthat contains 
Dow -J ones Industrial Average (DJIA) closing values from 1900 to 1993. There were a few negative values 
(e.g. —9), which we removed. We focused on prefixes of the data set of sizes up to 16384. 

C. Quality of Synopsis 

The £oo errors as a function of B are shown in figures 8(a) and 8(b) The e in the approximation algorithms 



UNREST and HYBRID was set to 1. All the algorithms gave very similar synopses for the Saw data and had almost 
the same errors. In case of the Dow data we show the range B = 5 onward since the maximum value is ~ 500 and 
the large errors for B < 5 (for all algorithms) bias the scale making the differences in the more interesting ranges 
not visible. The algorithm REST has more than 20% worse error compared to UNREST or requires over 35% more 
coefficients to achieve the same error (for most error values). The HYBRID algorithm performs consistently in the 
middle. 

D. Running Times 

Figure [9] shows the running times of the algorithms as the prefix size n is varied for the Dow data. As mentioned 
above e was set to 1. The grid in the log-log plot helps us clearly identify the quadratic nature of REST. The 
algorithms UNREST and HYBRID behave linearly as is expected from streaming algorithms. Given its speed and 
quality, the HYBRID algorithm seems to be the best choice from a practical perspective. 

W See http://lib.stat.cmu.edu/datasets/djdc0093. 



DRAFT 



35 




92 16384 



Fig. 9: Running times for prefixes of the Dow data set 
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